This document demonstrates a complete single-cell RNA-seq analysis workflow using the PBMC dataset with Seurat V5. The analysis includes:
We load the 10X Genomics data using the Read10X
function.
# Function: Read10X()
# Input:
# - genes.tsv (gene names, character vector)
# - barcodes.tsv (cell barcodes, character vector)
# - matrix.mtx (expression counts, sparse matrix format)
# Output:
# - pbmc.data: dgCMatrix (sparse matrix)
# - Dimensions: 32738 genes (rows) × 2700 cells (columns)
# - Data type: Integer counts (non-negative)
pbmc.data <- Read10X(data.dir = "/Users/outobi/Documents/Learn bioinformatics /scRNA by Seurat V5/filtered_gene_bc_matrices/hg19/")We create a Seurat object with basic filtering: -
min.cells = 3: Keep genes detected in at least 3 cells -
min.features = 200: Keep cells with at least 200 detected
genes
# Function: CreateSeuratObject()
# Input:
# - counts: sparse matrix (32738 genes × 2700 cells, integer counts)
# - min.cells: numeric (3)
# - min.features: numeric (200)
# Output:
# - pbmc: Seurat object
# - Filtered to 13714 features × 2700 cells
# - meta.data: data.frame with 3 columns (orig.ident, nCount_RNA, nFeature_RNA)
# - orig.ident: character (project name)
# - nCount_RNA: numeric (total counts per cell)
# - nFeature_RNA: numeric (number of detected genes per cell)
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
## 1 layer present: counts
The resulting object contains 13,714 features across 2,700 samples within 1 assay.
pbmc
├── assays
│ └── RNA
│ └── layers
│ └── counts (13714 genes × 2700 cells)
├── meta.data (cell-level metadata: orig.ident, nCount_RNA, nFeature_RNA)
└── project: pbmc3k
Let’s look at the expression of some marker genes (CD3D, TCL1A, MS4A1) across the first 30 cells:
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
The [[ operator can add columns to object metadata
directly! This is a great place to stash QC stats. We calculate the
percentage of mitochondrial gene expression, which is an indicator of
cell quality. High mitochondrial content often indicates dying
cells.
# Function: PercentageFeatureSet()
# Input:
# - pbmc: Seurat object (13714 genes × 2700 cells)
# - pattern: character ("^MT-", regex for mitochondrial genes)
# Output:
# - percent.mt: numeric vector (length 2700, one value per cell)
# - Data type: numeric (percentage, 0-100)
# - Added to pbmc@meta.data as new column
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# View assays and metadata
pbmc@assays## $RNA
## Assay (v5) data with 13714 features for 2700 cells
## First 10 features:
## AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115, NOC2L,
## KLHL17, PLEKHN1, RP11-54O7.17, HES4
## Layers:
## counts
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
## AAACGCACTGGTAC-1 pbmc3k 2163 781 1.6643551
Examine relationships between QC metrics:
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2Based on the QC metrics, we filter cells to retain high-quality cells: - Between 200 and 2500 detected genes - Less than 5% mitochondrial content
# Function: subset()
# Input:
# - pbmc: Seurat object (13714 genes × 2700 cells)
# - subset: logical expression (filtering criteria)
# Output:
# - pbmc: Seurat object (filtered)
# - Dimensions: 13714 genes × 2638 cells (62 cells removed)
# - Retains only high-quality cells
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc## An object of class Seurat
## 13714 features across 2638 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
## 1 layer present: counts
pbmc
├── assays
│ └── RNA
│ └── layers
│ └── counts (filtered cells)
├── meta.data (cell-level metadata: orig.ident, nCount_RNA, nFeature_RNA, percent.mt)
└── project: pbmc3k
We normalize the data using log-normalization with a scale factor of 10,000. This accounts for differences in sequencing depth between cells.
# Function: NormalizeData()
# Input:
# - pbmc@assays$RNA@layers$counts: sparse matrix (13714 genes × 2638 cells, integer counts)
# - normalization.method: character ("LogNormalize")
# - scale.factor: numeric (10000)
# Output:
# - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells)
# - Data type: numeric (log-normalized values)
# - Formula: log1p(counts / total_counts * 10000)
# - Values typically range from 0 to ~10
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)pbmc
├── assays
│ └── RNA
│ └── layers
│ ├── counts (raw counts)
│ └── data (normalized data)
├── meta.data
└── project: pbmc3k
We identify highly variable genes that show high cell-to-cell variation. These genes are informative for downstream analysis.
# Function: FindVariableFeatures()
# Input:
# - pbmc@assays$RNA@layers$data
# Sparse matrix (13714 genes × 2638 cells) of log-normalized counts
# - selection.method = "vst"
# Variance-stabilizing transformation used to model mean–variance relationship
# - nfeatures = 2000
# Number of most variable genes to retain
#
# Output:
# - pbmc@assays$RNA@meta.data (feature-level metadata; 13714 rows × 8 columns)
# * vf_vst_counts_mean
# Mean expression of each gene across all cells (on vst input scale)
# * vf_vst_counts_variance
# Observed variance of expression for each gene
# * vf_vst_counts_variance.expected
# Expected variance at that mean, from the fitted mean–variance curve
# * vf_vst_counts_variance.standardized
# Standardized variance = (observed / expected); measures “extra” variability
# * vf_vst_counts_variable
# Logical flag; TRUE if the gene is selected as a highly variable feature
# * vf_vst_counts_rank
# Rank of each gene by standardized variance (1 = most variable)
# * var.features
# Gene name if it is a variable feature, otherwise NA (redundant helper column)
# * var.features.rank
# Rank among variable features (usually matches vf_vst_counts_rank for HVGs)
#
# - Variable features:
# VariableFeatures(pbmc) returns the top nfeatures (e.g., 2000) genes
# where vf_vst_counts_variable == TRUE.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# top 10 highly variable gene
VariableFeatures(pbmc)[1:10]## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
## [9] "GNG11" "S100A8"
pbmc
├── assays
│ └── RNA
│ ├── layers
│ │ ├── counts
│ │ └── data
│ └── meta.data (vst.mean, vst.variance, vst.variance.expected, vst.variance.standardized, )
└── meta.data
We scale the data to give equal weight to each gene in downstream analyses. This centers the expression of each gene to mean 0 and scales to unit variance.
# Function: ScaleData()
# Input:
# - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
# - features: character vector (13714 gene names, all genes)
# Output:
# - pbmc@assays$RNA@layers$scale.data: dense matrix (13714 genes × 2638 cells)
# - Data type: numeric (scaled values)
# - For each gene: mean = 0, standard deviation = 1
# - Values typically range from -3 to +3 (z-scores)
# - Note: Converts from sparse to dense matrix
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)pbmc
├── assays
│ └── RNA
│ ├── layers
│ │ ├── counts # raw counts (sparse)
│ │ └── data # log-normalized expression
│ └── meta.data # feature-level metadata (13714 genes × 8 columns)
│ ├── vf_vst_counts_mean # mean expression
│ ├── vf_vst_counts_variance # observed variance
│ ├── vf_vst_counts_variance.expected # expected variance from mean–variance fit
│ ├── vf_vst_counts_variance.standardized# standardized variance (extra variability)
│ ├── vf_vst_counts_variable # TRUE/FALSE: selected as HVG
│ ├── vf_vst_counts_rank # rank by variability (1 = most variable)
│ ├── var.features # gene name if HVG, else NA
│ └── var.features.rank # rank among variable features
└── meta.data
## Counts layer (first 5x5):
## 5 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
##
## Normalized data layer (first 5x5):
## 5 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . . . .
## [5,] . . . . .
##
## Scaled data layer (first 5x5):
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.05812316 -0.05812316 -0.05812316 -0.05812316 -0.05812316
## [2,] -0.03357571 -0.03357571 -0.03357571 -0.03357571 -0.03357571
## [3,] -0.04166819 -0.04166819 -0.04166819 -0.04166819 -0.04166819
## [4,] -0.03364562 -0.03364562 -0.03364562 -0.03364562 -0.03364562
## [5,] -0.08223981 -0.08223981 -0.08223981 -0.08223981 -0.08223981
We perform PCA on the scaled data using only the variable features (only in selected most variable features after scale). This reduces dimensionality while retaining the most important sources of variation.
# Function: RunPCA()
# Input:
# - pbmc@assays$RNA@layers$scale.data: matrix (13714 genes × 2638 cells, scaled)
# - features: character vector (2000 variable genes)
# Output:
# - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
# - Data type: numeric (PCA scores, cell coordinates in PC space)
# - Each row is a cell, each column is a principal component
# - pbmc@reductions$pca@feature.loadings: matrix (2000 genes × 50 PCs)
# - Data type: numeric (gene loadings/weights)
# - Each row is a gene, each column is a principal component
# - pbmc@reductions$pca@stdev: numeric vector (50 values, standard deviation of each PC)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))pbmc
├── assays
│ └── RNA
│ └── layers (counts, data, scale.data)
├── meta.data
└── reductions
└── pca
├── cell.embeddings (PCA scores: 2638 cells × 50 PCs)
└── feature.loadings (PCA loadings: 2000 genes × 50 PCs)
# Print the genes with highest loadings for the first 5 PCs
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
The Seurat object now contains:
pbmc
├── assays
│ └── RNA (counts, data, scale.data layers)
├── meta.data (cell-level metadata)
├── reductions
│ └── pca
│ ├── cell.embeddings (PCA scores: 2638 cells × 50 PCs)
│ └── feature.loadings (PCA loadings: 2000 genes × 50 PCs)
└── ...
Visualize loadings for PC1 and PC2:
Visualize loadings for PC1 only:
Use an elbow plot to determine the number of PCs to use for downstream analysis:
We chose 10 here, but encourage users to consider the following:
The clustering workflow involves two main steps:
# Function: FindNeighbors()
# Input:
# - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
# - dims: numeric vector (1:10, use first 10 PCs)
# Output:
# - pbmc@graphs$RNA_nn: sparse matrix (2638 × 2638, KNN graph)
# - Data type: numeric, values 0 or 1 (binary adjacency)
# - pbmc@graphs$RNA_snn: sparse matrix (2638 × 2638, SNN graph)
# - Data type: numeric, values 0 to 1 (weighted edges, Jaccard similarity)
# Function: FindClusters()
# Input:
# - pbmc@graphs$RNA_snn: sparse matrix (2638 × 2638)
# - resolution: numeric (0.5)
# Output:
# - pbmc@meta.data$seurat_clusters: factor (2638 values, cluster IDs: 0-8)
# - pbmc@active.ident: factor (2638 values, same as seurat_clusters)
# - Data type: categorical (9 clusters numbered 0 to 8)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95927
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds
pbmc
├── assays
│ └── RNA
├── meta.data (added: seurat_clusters)
├── reductions
│ └── pca
├── graphs
│ ├── RNA_nn (KNN graph: 2638 × 2638 sparse matrix)
│ └── RNA_snn (SNN graph: 2638 × 2638 sparse matrix)
└── active.ident (cluster assignments)
## [1] "RNA_nn" "RNA_snn"
## A Graph object containing 2638 cells
# KNN graph: 2638 * 2638 0 or 1 matrix (1 is in this cell k closest distance)
RNA_nn <- pbmc@graphs$RNA_nn
dim(RNA_nn) # n_cells x n_cells## [1] 2638 2638
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AAACATACAACCAC-1 1 . .
## AAACATTGAGCTAC-1 . 1 .
## AAACATTGATCAGC-1 . . 1
## AAACCGTGCTTCCG-1 . . .
## AAACCGTGTATGCG-1 . . .
## AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AAACATACAACCAC-1 . .
## AAACATTGAGCTAC-1 . .
## AAACATTGATCAGC-1 . .
## AAACCGTGCTTCCG-1 1 .
## AAACCGTGTATGCG-1 . 1
# SNN graph: 2638 * 2638 0 to 1 matrix (quantify their similarity or overlap in knn)
RNA_snn <- pbmc@graphs$RNA_snn
dim(RNA_snn)## [1] 2638 2638
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## AAACATACAACCAC-1 1.0000000 . . . . . 0.1111111 . . .
## AAACATTGAGCTAC-1 . 1 . . . . . . . .
## AAACATTGATCAGC-1 . . 1 . . . . . . .
## AAACCGTGCTTCCG-1 . . . 1 . . . . . .
## AAACCGTGTATGCG-1 . . . . 1 . . . . .
## AAACGCACTGGTAC-1 . . . . . 1 . . . .
## AAACGCTGACCAGT-1 0.1111111 . . . . . 1.0000000 0.1111111 . .
## AAACGCTGGTTCTT-1 . . . . . . 0.1111111 1.0000000 . .
## AAACGCTGTAGCCA-1 . . . . . . . . 1 .
## AAACGCTGTTTCTG-1 . . . . . . . . . 1
## [1] 0.08108108 1.00000000
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
Run UMAP for 2D visualization of the data.
# Function: RunUMAP()
# Input:
# - pbmc@reductions$pca@cell.embeddings: matrix (2638 cells × 50 PCs)
# - dims: numeric vector (1:10, use first 10 PCs)
# Output:
# - pbmc@reductions$umap@cell.embeddings: matrix (2638 cells × 2 dimensions)
# - Data type: numeric (UMAP coordinates)
# - Column names: UMAP_1, UMAP_2
# - Values typically range from -10 to +10
# - Each row represents a cell's 2D position for visualization
pbmc <- RunUMAP(pbmc, dims = 1:10)pbmc
├── assays
│ └── RNA
├── meta.data (seurat_clusters)
├── reductions
│ ├── pca
│ └── umap
│ └── cell.embeddings (UMAP coordinates: 2638 cells × 2 dimensions)
├── graphs (RNA_nn, RNA_snn)
└── active.ident
Find all marker genes upregulated in cluster 2 compared to all other cells:
# Function: FindMarkers()
# Input:
# - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
# - pbmc@active.ident: factor (cluster assignments)
# - ident.1: numeric or character (cluster 2)
# Output:
# - cluster2.markers: data.frame (genes × 5 columns)
# - p_val: numeric (p-value from differential expression test)
# - avg_log2FC: numeric (average log2 fold change)
# - pct.1: numeric (% of cells in cluster 2 expressing the gene)
# - pct.2: numeric (% of cells in other clusters expressing the gene)
# - p_val_adj: numeric (adjusted p-value)
# - Rows sorted by p_val (most significant first)
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2)
head(cluster2.markers, n = 5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 2.892340e-90 1.3070772 0.947 0.465 3.966555e-86
## LTB 1.060121e-86 1.3312674 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71 1.0597620 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68 1.4377848 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67 0.9911924 0.954 0.614 2.252497e-63
Find markers distinguishing cluster 5 from specific other clusters:
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 8.246578e-205 6.794969 0.975 0.040 1.130936e-200
## IFITM3 1.677613e-195 6.192558 0.975 0.049 2.300678e-191
## CFD 2.401156e-193 6.015172 0.938 0.038 3.292945e-189
## CD68 2.900384e-191 5.530330 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186 6.297999 0.840 0.017 3.446663e-182
Find marker genes for every cluster compared to all remaining cells (positive markers only):
# Function: FindAllMarkers()
# Input:
# - pbmc@assays$RNA@layers$data: sparse matrix (13714 genes × 2638 cells, log-normalized)
# - pbmc@active.ident: factor (9 clusters: 0-8)
# - only.pos: logical (TRUE, return only upregulated genes)
# Output:
# - pbmc.markers: data.frame (genes × 7 columns)
# - p_val: numeric (p-value)
# - avg_log2FC: numeric (average log2 fold change)
# - pct.1: numeric (% cells in cluster expressing gene)
# - pct.2: numeric (% cells in other clusters expressing gene)
# - p_val_adj: numeric (adjusted p-value)
# - cluster: factor (cluster ID, 0-8)
# - gene: character (gene name)
# - Contains markers for all 9 clusters
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
# Filter for genes with strong upregulation (log2FC > 1)
pbmc.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1)## # A tibble: 7,019 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 3.75e-112 1.21 0.912 0.592 5.14e-108 0 LDHB
## 2 9.57e- 88 2.40 0.447 0.108 1.31e- 83 0 CCR7
## 3 1.15e- 76 1.06 0.845 0.406 1.58e- 72 0 CD3D
## 4 1.12e- 54 1.04 0.731 0.4 1.54e- 50 0 CD3E
## 5 1.35e- 51 2.14 0.342 0.103 1.86e- 47 0 LEF1
## 6 1.94e- 47 1.20 0.629 0.359 2.66e- 43 0 NOSIP
## 7 2.81e- 44 1.53 0.443 0.185 3.85e- 40 0 PIK3IP1
## 8 6.27e- 43 1.99 0.33 0.112 8.60e- 39 0 PRKCQ-AS1
## 9 1.16e- 40 2.70 0.2 0.04 1.59e- 36 0 FHIT
## 10 1.34e- 34 1.96 0.268 0.087 1.84e- 30 0 MAL
## # ℹ 7,009 more rows
Seurat has several tests for differential expression which can be set with the test.use parameter (see Seurat DE vignette for details). For example, the ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25,
test.use = "roc", only.pos = TRUE)
head(cluster0.markers)## myAUC avg_diff power avg_log2FC pct.1 pct.2
## RPS12 0.827 0.5059247 0.654 0.7387061 1.000 0.991
## RPS6 0.826 0.4762402 0.652 0.6934523 1.000 0.995
## RPS27 0.824 0.5047203 0.648 0.7372604 0.999 0.992
## RPL32 0.821 0.4294911 0.642 0.6266075 0.999 0.995
## RPS14 0.811 0.4334133 0.622 0.6336957 1.000 0.994
## RPS25 0.803 0.5196163 0.606 0.7689940 0.997 0.975
Extract the top 10 marker genes per cluster and visualize in a heatmap:
# Extract top 10 genes per cluster
# 1. group_by(cluster)
# → "Split" the table by cluster.
# Now all subsequent operations happen within each cluster separately.
# 2. filter(avg_log2FC > 1)
# → Keep only genes that are strongly upregulated in that cluster
# (log2 fold change > 1 means at least ~2× higher expression).
# 3. slice_head(n = 10)
# → For each cluster, keep the first 10 rows (after filtering).
# Since FindAllMarkers usually sorts by p_val or p_val_adj, this means:
# "Take the top 10 strongest / most significant marker genes per cluster."
# 4. ungroup()
# → Remove the grouping so top10 is now just a normal tibble.
# 5. -> top10
# → Save the result as a new object top10.
pbmc.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 10) %>%
ungroup() -> top10# Draw a heatmap of the top 10 upregulated marker genes for each cluster across all cells.
DoHeatmap(pbmc, features = top10$gene) + NoLegend()Ridge plots show the distribution of expression across clusters:
Examine feature correlation among all cells:
Cell correlation among variable features:
CellScatter(
pbmc,
cell1 = colnames(pbmc)[1],
cell2 = colnames(pbmc)[2],
features = VariableFeatures(pbmc)
)Classical bubble plot of marker genes. Dot plots are a concise way to visualize marker gene expression across clusters. The size of the dot represents the percentage of cells expressing the gene, and the color intensity represents the average expression level.
markers.to.plot <- c("MS4A1", "CD79A", # B cells
"CD3D", "IL7R", # CD4 T
"LTB", "NKG7") # others
DotPlot(pbmc, features = markers.to.plot) + RotatedAxis()Based on canonical marker genes, we assign cell type identities to each cluster:
# Function: RenameIdents()
# Input:
# - pbmc@active.ident: factor (9 levels: 0-8, numeric cluster IDs)
# - new.cluster.ids: named character vector (9 cell type labels)
# Output:
# - pbmc@active.ident: factor (9 levels, cell type names)
# - Data type: categorical (character labels)
# - Levels: "Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T",
# "FCGR3A+ Mono", "NK", "DC", "Platelet"
# - Note: Original numeric clusters preserved in pbmc@meta.data$seurat_clusters
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T",
"FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)pbmc
├── assays
│ └── RNA (counts, data, scale.data layers)
├── meta.data (seurat_clusters, nCount_RNA, nFeature_RNA, percent.mt)
├── reductions
│ ├── pca (50 PCs)
│ └── umap (2D coordinates)
├── graphs (RNA_nn, RNA_snn)
└── active.ident (cell type labels: Naive CD4 T, CD14+ Mono, Memory CD4 T, etc.)
Save the final Seurat object for future use (R Data Serialization, one object only):
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] future_1.67.0 patchwork_1.3.2 Seurat_5.3.1 SeuratObject_5.2.0
## [5] sp_2.2-0 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] deldir_2.0-4 pbapply_1.7-4 gridExtra_2.3
## [4] rlang_1.1.6 magrittr_2.0.4 RcppAnnoy_0.0.22
## [7] otel_0.2.0 spatstat.geom_3.6-0 matrixStats_1.5.0
## [10] ggridges_0.5.7 compiler_4.4.3 png_0.1-8
## [13] vctrs_0.6.5 reshape2_1.4.4 stringr_1.6.0
## [16] pkgconfig_2.0.3 fastmap_1.2.0 labeling_0.4.3
## [19] utf8_1.2.6 promises_1.5.0 rmarkdown_2.30
## [22] purrr_1.2.0 xfun_0.54 cachem_1.1.0
## [25] jsonlite_2.0.0 goftest_1.2-3 later_1.4.4
## [28] spatstat.utils_3.2-0 irlba_2.3.5.1 parallel_4.4.3
## [31] cluster_2.1.8.1 R6_2.6.1 ica_1.0-3
## [34] bslib_0.9.0 stringi_1.8.7 RColorBrewer_1.1-3
## [37] spatstat.data_3.1-9 limma_3.62.2 reticulate_1.44.0
## [40] parallelly_1.45.1 spatstat.univar_3.1-4 lmtest_0.9-40
## [43] jquerylib_0.1.4 scattermore_1.2 Rcpp_1.1.0
## [46] knitr_1.50 tensor_1.5.1 future.apply_1.20.0
## [49] zoo_1.8-14 R.utils_2.13.0 sctransform_0.4.2
## [52] httpuv_1.6.16 Matrix_1.7-4 splines_4.4.3
## [55] igraph_2.2.1 tidyselect_1.2.1 abind_1.4-8
## [58] rstudioapi_0.17.1 yaml_2.3.10 spatstat.random_3.4-2
## [61] codetools_0.2-20 miniUI_0.1.2 spatstat.explore_3.5-3
## [64] listenv_0.10.0 lattice_0.22-7 tibble_3.3.0
## [67] plyr_1.8.9 withr_3.0.2 shiny_1.11.1
## [70] S7_0.2.0 ROCR_1.0-11 evaluate_1.0.5
## [73] Rtsne_0.17 fastDummies_1.7.5 survival_3.8-3
## [76] polyclip_1.10-7 fitdistrplus_1.2-4 pillar_1.11.1
## [79] KernSmooth_2.23-26 plotly_4.11.0 generics_0.1.4
## [82] RcppHNSW_0.6.0 ggplot2_4.0.0 scales_1.4.0
## [85] globals_0.18.0 xtable_1.8-4 glue_1.8.0
## [88] lazyeval_0.2.2 tools_4.4.3 data.table_1.17.8
## [91] RSpectra_0.16-2 RANN_2.6.2 dotCall64_1.2
## [94] cowplot_1.2.0 grid_4.4.3 tidyr_1.3.1
## [97] nlme_3.1-168 cli_3.6.5 spatstat.sparse_3.1-0
## [100] spam_2.11-1 viridisLite_0.4.2 uwot_0.2.3
## [103] gtable_0.3.6 R.methodsS3_1.8.2 sass_0.4.10
## [106] digest_0.6.37 progressr_0.17.0 ggrepel_0.9.6
## [109] htmlwidgets_1.6.4 farver_2.1.2 R.oo_1.27.1
## [112] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [115] statmod_1.5.1 mime_0.13 MASS_7.3-65